Skip to content

fix: attach on macOS framework Python (Homebrew, python.org)#31

Merged
bppps merged 1 commit into
alibaba:mainfrom
sunmiaozju:fix/macos-attach
May 18, 2026
Merged

fix: attach on macOS framework Python (Homebrew, python.org)#31
bppps merged 1 commit into
alibaba:mainfrom
sunmiaozju:fix/macos-attach

Conversation

@sunmiaozju

@sunmiaozju sunmiaozju commented May 18, 2026

Copy link
Copy Markdown
Contributor

Background / 背景

flight_profiler <pid> 在 macOS 上 attach 失败:

  • lldb 路径 (CPython ≤ 3.13) 报 test find take_gil function failed
  • sys.remote_exec 路径 (CPython ≥ 3.14) 报 flight_profiler and target process are not in the same python environment!

复现稳定。

Repro environment

  • macOS 15.1.1, Darwin 24.1.0, host arch arm64

  • 两套常见 Python 安装:

    • Homebrew Python 3.10.6 x86_64 (/usr/local/Cellar/python@3.10/...)
    • Homebrew Python 3.14.2 arm64 (/opt/homebrew/Cellar/python@3.14/...)
  • 目标进程 = 简单 sleep+print 循环:

    # /tmp/sleep_loop.py
    import time, os
    print(f"sleep_loop pid={os.getpid()}", flush=True)
    i = 0
    while True:
        print(f"tick {i}", flush=True); time.sleep(1); i += 1
  • 复现命令:flight_profiler <pid> --debug --cmd "help"

Root cause / 原因分析

README 里那句 "PyFlightProfiler must reside in the same Python environment" 是真实的硬约束。本 PR 不削弱这个约束,只是修两处导致「同环境」被误判 / 符号查不到的实现 bug。

关键事实:在 macOS framework Python 里,「一个 Python 环境」对应的不是单个文件

一个 Homebrew / python.org 安装的 Python,磁盘上至少同时摆着:

路径 用途
…/Python.framework/Versions/X.Y/Resources/Python.app/Contents/MacOS/Python 几十 KB 的 launcher;进程实际 exec 的就是它(lsof -p 看到的也是它);不含 Python C API 符号
…/Python.framework/Versions/X.Y/Python 几 MB 的 framework dylibtake_gil 等 Python C 符号都在这里
…/bin/pythonX.Y wrapper / symlink,最终指向 launcher
venv 里的 bin/python 再套一层 symlink 指向 bin/pythonX.Y

这四个路径同属一个 Python 环境,但是它们是不同文件。后续两个 bug 都是因为旧代码没把这一点处理对。


Bug 1 — sys.remote_exec 路径:同环境检查里两端用了不同的解析方式

被涉及的脚本是 flight_profiler/shell/mac/py_bin_base_addr_locate.sh,里面会比较 client 和 server 的 Python binary 路径,不相等就 abort。但旧代码两端取这个路径的方式是不一致的:

旧实现的解析方式 在我们机器上的实际值
server lsof -p <server_pid>txt → 进程实际跑的 binary …/Python.app/Contents/MacOS/Python(launcher)
client (旧) sys.executablerealpath → 跟 symlink 跳到 bin 那层 …/bin/python3.14
client (新) lsof -p $$ 同样取 txt …/Python.app/Contents/MacOS/Python ✅ 与 server 一致

旧实现用严格字符串相等比这两个值。即使 client 和 server 是同一个解释器启动的,两边解析路径不同,比较的字符串本身就不等:

sys.executable -> realpath -> .../Versions/3.14/bin/python3.14
lsof  txt                   -> .../Versions/3.14/Resources/Python.app/Contents/MacOS/Python

→ 即使是同环境也被判 "not in the same python environment" → attach 中止。

自相矛盾的证据client.py:show_pre_attach_info() 里那行打印 Verify pyFlightProfiler and target are using the same python executable: 🌟 的诊断,本身就用 get_py_bin_path()(即 lsof)取双方路径。也就是说之前的代码:

  • 诊断说"同环境 ✓"
  • 紧接着 attach 的硬比较却用另一套路径解析
  • 结果"同环境 ✓"和"not in the same python environment ❌"在同一次运行里同时出现

新实现把 client.py:get_base_addr 里传给 shell 的 client 路径从 str(sys.executable) 改成 get_py_bin_path(os.getpid()),让 attach 比较和已有的诊断使用同一种路径解析方式。规则不变(仍然要求两端 binary 相等),只是判定方法变得一致、可信。

验证安全性没被削弱:用 3.14 arm64 venv 的 flight_profiler 去 attach 一个 3.10 x86_64 目标进程(明确跨环境),新代码下仍然命中:

[DEBUG] Server Python Executable: /usr/local/.../python@3.10/.../MacOS/Python
[DEBUG] Client Python Executable: /opt/homebrew/.../python@3.14/.../MacOS/Python
[INFO] Verify pyFlightProfiler and target are using the same python executable: ❌
flight_profiler and target process are not in the same python environment!

跨环境检查仍然按预期工作。


Bug 2 — lldb 路径:nm 查符号时跑错了文件

跟同环境检查无关,纯粹是符号查找路径问题。

flight_profiler/shell/resolve_symbol.shnmtake_gil 的偏移。旧代码让 nm 跑在 resolve_bin_path.sh 返回的 binary 上,而那是 launcher:

$ nm .../Python.app/Contents/MacOS/Python | grep take_gil
(empty)

$ nm .../Python.framework/Versions/X.Y/Python | grep take_gil
000000000015c5be t _take_gil

所以 launcher 上跑 nm 永远查不到 → resolve_symbol.shinvalid python process $pid, test find take_gil function failed → attach 中止。

为什么不能直接让 resolve_bin_path.sh 也返回 framework dylib? 因为 lldb 那边需要 launcher:

lldb $py_bin_path
(lldb) process attach -p $pid

如果 $py_bin_path 是 framework dylib(不是真实 exec 的 binary),lldb 会报 error: no error returned from Target::Attach, and target has no process。这是我们已经实测过的、之前社区里有人为了修 Bug 2 改 resolve_bin_path.sh 把 launcher 重定向到 dylib 之后引入的另一个 regression。

正确做法:launcher 用于 attach、framework dylib 用于符号查找,职责分离。所以本 PR:

  • 保留 resolve_bin_path.sh 不变(继续返回真实跑的 binary,即 launcher)
  • resolve_symbol.sh 里加一个 macOS framework launcher → 同框架 dylib 的回退分支,仅 nm 查询走这条回退

这两个文件本来就同属一个 Python 环境,回退不会引入跨环境风险。

Fix / 解决方案

只动两个文件,22 行新增:

  • flight_profiler/client.pyget_base_addr 改为传 get_py_bin_path(os.getpid()),让 client 与 server 用同一份 lsof 解析路径,不再用 sys.executable
  • flight_profiler/shell/resolve_symbol.sh — 当 symbol_bin_path 命中 …/Python.app/Contents/MacOS/Python pattern 时,把 nm 重定向到同 framework 下的 …/Versions/X.Y/Python dylib。resolve_bin_path.sh 不动,lldb 仍然拿到正确的 launcher 用于 process attach -p

设计原则:launcher 与 framework dylib 是 macOS framework Python 上两件不同的事(attach 用 launcher / 符号查询用 dylib),不要把它们合到同一个解析函数里。

Verification / 验证

Path Python Arch Result
sys.remote_exec 3.14.2 arm64 attach OK;getglobal __main__ i 返回不断增长的活计数器(PEP 768 在 macOS 仍需 sudo 满足 task_for_pid,与本 PR 无关)
lldb 3.10.6 x86_64 attach OK(无需 sudo,同 UID);getglobal __main__ i 同上
跨环境(client 3.14 arm64 → target 3.10 x86_64) 检查仍然命中 ❌,attach 中止(说明同环境约束未被削弱)

复现验证步骤:

# Target
python3 /tmp/sleep_loop.py & PID=$!

# Attach + smoke test
flight_profiler $PID --cmd "help"
flight_profiler $PID --cmd "getglobal __main__ i"

修复前 --debug 输出会停在 flight_profiler and target process are not in the same python environment!test find take_gil function failed;修复后 lldb / sys.remote_exec 都能完成注入并响应命令。

Out of scope / 显式不处理

  • 安装产物里 flight_profiler/lib/flight_profiler_agent.dylib 是否随 wheel 分发是另一个问题(需要先 make build 产出 dylib),与本 PR 无关,可在后续 PR 处理。
  • macOS Python ≥ 3.14 的 sudo 要求源自 PEP 768 / task_for_pid 系统约束,不是 bug。

On macOS framework Pythons, the running executable is the small
Python.app/Contents/MacOS/Python launcher (no Python symbols), while
the Python C-API symbols live in the sibling framework dylib at
Python.framework/Versions/X.Y/Python. Both attach paths broke on this:

- sys.remote_exec path (CPython >= 3.14): client.py:get_base_addr passed
  sys.executable (resolves to bin/pythonX.Y) to py_bin_base_addr_locate.sh,
  which compared it against the lsof-derived launcher path on the server
  side. The strict string compare aborted with "not in the same python
  environment" even when client and target shared the same interpreter.
- lldb path (CPython <= 3.13): resolve_bin_path.sh correctly returns the
  launcher (lldb needs it for `process attach -p` to succeed), but
  resolve_symbol.sh ran nm against the launcher and found no take_gil
  symbol, so attach aborted with "test find take_gil function failed".

Fix:
- client.py: pass get_py_bin_path(os.getpid()) to py_bin_base_addr_locate.sh
  so the client side uses the same lsof resolution as the server side.
- resolve_symbol.sh: when symbol_bin_path points at the framework launcher,
  redirect nm to the sibling framework dylib. resolve_bin_path.sh stays
  unchanged so lldb still attaches to the real running executable.

Verified on macOS 15.1.1 arm64:
- Python 3.14.2 arm64 (sys.remote_exec, with sudo for task_for_pid):
  attach OK, getglobal __main__ i returns live counter.
- Python 3.10.6 x86_64 (lldb path, no sudo): attach OK, getglobal
  __main__ i returns live counter.
@sunmiaozju sunmiaozju requested a review from bppps as a code owner May 18, 2026 04:46
@bppps bppps merged commit 9487291 into alibaba:main May 18, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants